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Abstract: An algorithm for the direct inversion of the linear systems arising from Nystrom 
discretization of integral equations on one-dimensional domains is described. The method 
typically has O(N) complexity when applied to boundary integral equations (BIEs) in the plane 
with non-oscillatory kernels such as those associated with the Laplace and Stokes' equations. 
The scaling coefficient suppressed by the "big-O" notation depends logarithmically on the 
requested accuracy. The method can also be applied to BIEs with oscillatory kernels such 
as those associated with the Hclmholtz and Maxwell equations; it is efficient at long and 
intermediate wave-lengths, but will eventually become prohibitively slow as the wave-length 
decreases. To achieve linear complexity, rank deficiencies in the off-diagonal blocks of the 
coefficient matrix are exploited. The technique is conceptually related to the H- and TL 2 - 
matrix arithmetic of Hackbusch and co-workers, and is closely related to previous work on 
\Q ' Hierarchically Semi- Separable matrices. 
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r ~7 1 ' 1. Introduction 

1.1. Problem formulation. The paper describes techniques for numerically solving equations 
of the type 

(1.1) a(t)q(t)+ [ b(t,t')q(t')dt' = f(t), t G [0, T], 
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where I = [0, T] is an interval on the line, and where a : / —> R and b : I X / — > R are given 
functions. We observe that a boundary integral equation (BIE) such as 

(1.2) a(x)q(x) + J b(x,x')q(x')dl(x') = f(x), x G V, 



where T is a simple curve in the plane takes the form (1.1) upon parameterization of the curve. 
The case of a domain T that consists of several non-connected simple curves can also be handled. 
Upon discretization, equation (1.1) takes the form 



(1.3) Aq = f 

where A is a dense matrix of size, say, N x N. When N is large, standard practice for rapidly 
solving a system such as (1.3) is to use an iterative solver (such as GMRES, conjugate gradients, 
etc.) in which the matrix-vector multiplications are accelerated via a "fast" method such as the 
Fast Multipole Method (FMM) [13], panel clustering [15], Barnes-Hut [2], etc. When the integral 
equation (1.1) is a Fredholm equation of the second kind, the iteration typically converges 
rapidly, and a linear solver of effectively O(N) complexity results. In contrast, this papers 
reviews and extends a number of recently developed direct solvers that in a single pass compute 
a data-sparse representation of a matrix S (a "solution operator") that satisfies 



Once a representation of S is available, the solution of (1.3) is of course easily constructed: 
(1.4) qnSf. 

We will demonstrate that in many important environments (such as, e.g., the BIEs associated 
with Laplace's equation in the plane), the matrix S can be constructed in O(N) operations. 
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1.2. Applications. The direct solver presented is applicable to most boundary integral equa- 
tions associated with the classical boundary value problems of mathematical physics (Laplace, 
elasticity, Helmholtz, Yukawa, Stokes, etc.) with the two important exceptions that it is not 
efficient for (1) problems involving highly oscillatory kernels such as Helmholtz equation at 
short wavelengths, and (2) domain boundaries that tend to "fill space" in the sense illustrated 
in Figure 1.1. We will demonstrate that high accuracy and speed can be maintained even for 
non-smooth boundaries. 

The direct solver is also applicable to many integral equations of the form (1.1) that arise in 
the analysis of special functions [28] , in evaluating conformal maps [24] , and in the analysis of 
two-point boundary value problems [27]. 

1.3. Advantages of direct solvers. Direct solvers offer several advantages over iterative ones: 

Speed-up by large factors for problems involving multiple right hand sides: In many situations, 
an equation such as (1.1) needs to be solved for several different data functions /. Iterative 
techniques can only to a limited extent take advantage of the fact that the operator is the same 
in each solve. For a direct method, on the other hand, each solve beyond the first simply involves 
applying a pre-computed inverse to a vector. The time required for applying the (compressed) 
inverse to a vector is typically much smaller than even the time required for a single application 
of the original operator using standard techniques. 

The ability to solve relatively ill-conditioned problems: Direct solvers allow for the rapid and 
accurate solution of linear systems involving relatively ill-conditioned matrices. In the context 
of boundary value problems, such ill-conditioning can be caused by physical ill-conditioning (as 
observed, e.g., when solving the equations of elasticity on domains with high aspect ratios, or 
when solving scattering problems near a resonant frequency), but may also be introduced as 
a side-effect of the mathematical formulation (e.g. when a formulation based on a Fredholm 
equation of the first kind is used, or when a model of a scattering problem introduces so called 
"spurious" resonances). 

Increased reliability: Direct methods are inherently more robust than iterative methods. This 
point is less important in an academic setting where there is often time to tweak a code until it 
performs well for a particular problem (for instance by designing a customized pre-conditioner) . 
However, it has proven difficult to design industrial codes using iterative methods and commer- 
cial software developers sometimes tend to shun iterative methods in favor of direct ones, even 
at significant cost in terms of speed. 

1.4. How the direct solver works. Letting e denote a user specified computational tolerance, 
the direct solver for (1.1) can be viewed as consisting of four steps: 

(i) Quadrature nodes and quadrature weights for a Nystrom discretization are created: The 
interval [0, T] is split into panels, and Gaussian nodes are placed on each panel. Customized 
quadrature weights are constructed using the method of [29] which ensures high accuracy even 
in the presence of weakly singular kernels (and for BIEs on domains with corners). 
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Figure 1.1. Contours for which the direct solver will not achieve 0(N) complexity. 
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(ii) Construction of the coefficient matrix: The matrix A in (1.3) is an N x TV matrix that is 
dense, but whose off-diagonal blocks are to high accuracy rank-deficient. We exploit this fact, 
and compute an approximant A approx which is stored in the data-sparse format very similar to 
the Hierarchically Semi- Separable (HSS) format of [26, 6]. 

(Hi) Inversion of the coefficient matrix: The approximant A a p prox of the coefficient matrix is 
inverted using a variation of the technique of [27, 21] to produce the solution operator S = 
^approx- The inversion is exact up to round-off errors. 

(iv) Application of the approximate inverse: The solution operator S is applied to the given data 
vector / to produce the solution q, cf. (1.4). 

Each of the four steps typically requires 0{N) work when applied to the standard equations 
of mathematical physics (with the two exceptions mentioned in Section 1.2). The constants of 
proportionality depend on the specific environment, but in general, Step 2 is the most expensive. 
The cost of Step (iv) is tiny, meaning that the proposed procedure is particularly effective in 
environments where a sequence of equations with the same coefficient matrix need to be solved. 

Remark 1.1. The computations in steps 3 and 4 are independent of the specific problem being 
solved, and can be implemented as "black-box" codes. 

1.5. Relationship to earlier work. The general idea of exploiting rank-deficiencies in the 
off-diagonal blocks of the matrix A in (1.3) underlies several "fast" algorithms (e.g. the Fast 
Multipole Method [13], panel clustering [15], Barnes-Hut [2]) for executing the matrix- vector 
multiply in an iterative solver. The observation that such rank-deficiencies can be also used in 
a systematic way to execute matrix inversion, matrix-matrix multiplies, matrix factorizations, 
etc., was made in the early work on %-matrices by Hackbusch and co-workers [16]. The basic 
version of these methods have 0(N(logN) p ) complexity for some small integer p. Certain 
operations were later accelerated to O(N) complexity in the context of 'H 2 -matrices, see [4] and 
the references therein. 

More specifically, the direct solver described in this paper is an evolution of the scheme of 
[21], which in turn draws on the earlier work [3, 23, 27]. Since [21] appeared, the algorithms 
have been significantly improved, primarily in that the compression and inversion steps have 
been separated. In addition to making the presentation much clearer, this separation leads to 
several concrete improvement, including: 

Improved versatility: Separating the task of compression from the task of inversion makes it 
much easier to apply the direct solver to new applications. If a BIE with a different kernel is to 
be solved, a slight modification of the compression step (Step (ii) in Section 1.4) is sufficient. It 
also opens up the possibility of combining the direct solver with generic compression techniques 
based on randomized sampling, e.g., those described in [20]. 

Improved quadratures: The version of the algorithm described in this paper is compatible with 
the quadratures of [17, 5] which enable the handling of BIEs defined on domains with corners, 
and the quadratures of [29] which simplify the handling of singular kernels. 

Improved theoretical analysis: The direct solver is in the present paper expressed transparently 
as a telescoping matrix factorization. This allows for a simplified error and stability analysis, as 
illustrated by, e.g., Lemma 3.1 and Corollary 3.2. 

Improved interoperability with other data-sparse matrix formats: The new version of the algo- 
rithm makes it clear that the data-sparse format used to represent both the coefficient matrix 
and its inverse are essentially identical to the Hierarchically Semi-Separable (HSS) format of 



1 



[26, 6]. This opens up the possibility of combining the compression techniques described in this 
paper with recently developed inversion and factorization algorithms for HSS matrices [8]. 

Remark 1.2. The paper uses the terms "block separable" (BS) and "hierarchically block sep- 
arable" (HBS). The HBS format is essentially identical to the HSS format. The terms BS and 
HBS were introduced for local purposes only since they clarify the description of the algorithm. 
There is no intention to replace the well-established term "HSS." 

1.6. Outline. Section 2 introduces notation and reviews the Nystrom discretization method 
for integral equations. Section 3 describes an accelerated direct solver based on a simplistic 
tessellation of an N x N matrix A into p x p blocks in such a way that all off-diagonal blocks 
are rank deficient. This method has complexity 0(p~ 2 N s + p 3 A; 3 ) where k is the rank of the 
off-diagonal blocks. To attain better asymptotic complexity, a more complicated hierarchical 
tessellation of the matrix must be implemented. This data structure is described in Section 4, 
and an 0(N k 2 ) inversion technique is then described in Section 5. Section 6 describes efficient 
techniques for computing the data-sparse representation in the first place. Section 7 describes 
some numerical experiments, and Section 8 describes possible extensions of the work. 



This section introduces notation, and briefly reviews some known techniques. 

2.1. Notation. We say that a matrix U is orthonormal if its columns form an orthonormal set. 
An orthonormal matrix U preserves geometry in the sense that |U x\ = \x\ for every vector x. 
We use the notation of [10] to denote submatrices: If A is an m x n matrix with entries A(i,j), 
and if / = 12, . . . , i p ] and J = [ji, j'2, • • • , j q ] are two index vectors, then the associated 
p x q submatrix is expressed as 



2.2. The Interpolatory Decomposition (ID). An m x n matrix B of rank k admits the 
factorization 



where J = [ji, . . . , jf.] is a vector of integers such that 1 < ji < m, and U is a m x k matrix 
that contains the k x k identity matrix 1^ (specifically, U(J, : ) = Moreover, no entry of U 
is larger than one. Computing the ID of a matrix is in general combinatorially expensive, but if 
the restriction on element size of U is relaxed slightly to say that, for instance, each entry of U 
is bounded by 2, then very efficient schemes are available. See [14, 9] for details. 

2.3. Nystrom discretization of integral equations in one dimension. In this section, we 
very briefly describe some variations of the classical Nystrom method for discretizing an integral 
equation such as (1.1). The material is well known and we refer to [1] for details. 

For an integral equation with a smooth kernel k(t,t'), the Nystrom method is particularly 
simple. The starting point is a quadrature rule for the interval [0,T] with nodes {U}^ C [0, T] 
and weights {u3i\f =l such that 



2. Preliminaries 



'il.il ' ' ' a iijg 



A(J,J) = 



For column- and row-submatrices, we use the standard abbreviations 

A(: ,J) = A([l, 2, m],J), and A(I, : ) = A(/, [1, 2, . . . 



n}). 



B = U B(J, : ) 




i = 1, 2 



N. 
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Then the discretized version of (1.1) is obtained by enforcing that 

n 

(2.1) a{t i )q{U) + ^b(t i ,t j )q(t j )ujj = f(U), i = l,2,...,N. 

j=i 

We write (2.1) compactly as 

Aq = /, 

where A is the N x N matrix with entries 

A(i, j) = Sij a(ti) + b(U, tj)uj, i, j = 1, 2, 3, . . . , N. 
where / is the vector with entries 

f(i) = f(ti), i = l, 2, 3, ...,N, 
and where q is the approximate solution which satisfies 

g(z) «<?(*;), * = 1, 2, 3, AT. 

We have found that using a composite quadrature rule with a 10-point standard Gaussian 
quadrature on each panel is a versatile and highly accurate choice. 

Remark 2.1 (Singular kernels). Some of the numerical examples described in Section 7 involve 
kernels with logarithmically singular kernels, 

k(t,t') ~log|i-£'|, as <'->•*. 

A standard quadrature rule designed for smooth functions would lose almost all accuracy on the 
panels where t and t' is close, but this can be remedied by modifying the matrix entries near 
the diagonal. For instance, when Gaussian quadrature nodes are used, the procedure described 
in [29] gives very accurate results. Alternatively, the Rokhlin-Kapur [18] procedure starts with 
a standard trapezoidal rule and modifies the weights near the end points to achieve high order 
convergence. This is a simpler method than the modified Gaussian rule of [29] but typically also 
produces lower accuracy. 

Remark 2.2 (Contours with corners). Discretizing an integral equation such as (1.2) can be 
challenging if the contour T is not smooth. When x € T is a corner point, the function x' i— > 
b(x, x') typically has a singularity at x. It has been demonstrated [17, 5] that in many cases 
of practical interest, it is nevertheless possible to use standard quadrature weights designed for 
smooth functions, as long as the discretization is locally refined near the corner. The drawback is 
that such refinement can increase the system size in an undesirable way but as [17] demonstrates, 
the system size can be reduced via a local pre-computation. In this paper, we demonstrate that 
it is alternatively possible to use general purpose direct solvers to achieve the same effect. 



3. Inversion of block separable matrices 



In this section, we define what it means for a matrix to be "block separable" and describe a 
simple technique for inverting such a matrix. 

Let A be an np x np matrix that is blocked into p x p blocks, each of size n x n: 



(3.1) 



Di A lj2 Ai, 3 
A 2 ,i D 2 A 2i3 



A p ,i A. 



p,2 «p,3 



D, 
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We say that A is "block separable" with "block-rank" k if for r = 1, 2, . . . , p, there exist n x k 
matrices U r and V r such that each off-diagonal block of A admits the factorization 



(3.2) 



A, 



n x n 



I J A V* 

n x k k x k k x n 



a,T G {1, 2, ... , p}, cr ^ r. 



Observe that the columns of U CT must form a basis for the columns of all off-diagonal blocks in 
row a, and analogously, the columns of V r must form a basis for the rows in all the off-diagonal 
blocks in column r. When (3.2) holds, the matrix A admits a block factorization 



(3.3) 
where 



A = U A V* 

np x np np x kp kp x kp kp x np 



+ 



np x np 



U = diag(Ui, U 2 , . . . , U p ), V = diag(Vi, V 



. v P ), 



D = diag(Di, D 2 , 



D 



pj, 



and 






Al2 


Al 3 


A 2 i 





A 23 


A31 


A 32 






The block structure of formula (3.3) for p = 4 is illustrated below: 

A = U A V* + 



D 



(3.4) 



The idea is that by excising the diagonal blocks from A, we obtain a rank-deficient matrix A — D 
that can be factored with block diagonal flanking matrices: A — D = U_ AV*. 

The inverse of a block-separable matrix can rapidly be constructed using the following simple 
variation of the classical Sherman- Morrison- Woodbury formula: 

Lemma 3.1. Suppose that A is an N x N invertible matrix. Suppose further that K is a positive 
integer such that K < N , that A admits the decomposition 



(3.5) 



A = U A V* 

N x N N x K K x K K xN 



+ 



D, 

N x N 



and that the matrices D, (V* D" 1 U), and (A + (V* D" 1 U)" 1 ) are invertible. Then 



(3.6) 

where 

(3.7) 

(3.8) 

(3.9) 

(3.10) 



E(A+D)~ 1 F* + G, 



D 
E 
F 

G 



(V'D^U) -1 , 
D" 1 U D, 
(DV* D" 1 )*, 
D" 1 - D~ 1 UDV*D 



-i 



When A is block-separable, (3.5) holds with block diagonal matrices U, V, and D. The 
matrices D, E, F, and G can then be evaluated rapidly, and Lemma 3.1 can be said to reduce 
the task of inverting the np x np matrix A, to the task of inverting the kp x kp matrix A + D. 



Proof of Lemma 3.1: Consider the equation 

(3.11) (UAV* + D)g = ii. 

We will prove that (3.6) holds by proving that the solution q of (3.11) is the right hand side of 
(3.6) applied to u. First we set 

(3.12) q = y*q- 
Then (3.11) can be written 

(3.13) UAq + Dq = u. 
Solving (3.13) for q and inserting the result in (3.12), we obtain 

(3.14) (/ + V*D~ 1 U A)q = V* D' 1 u. 

=D-i 

Multiplying (3.14) by D we find that 

(3.15) (D + A)q = py*D-\ u. 

=F* 

Now note that from (3.13) it also follows that 

(3.16) q = -D" 1 U Aq + D' 1 u. 
From (3.15) we know that 

(3.17) Aq= -Dq + F*u. 
Inserting (3.17) into (3.16), we obtain 

(3.18) q = -D- 1 U (-D q + F* u) + D" 1 u = D" 1 U D q + (D" 1 - D" 1 U F*) u. 



■■v — 

( 



Solving (3.15) for q and inserting the result in (3.18), we obtain the expression (3.6). □ 

The technique provided by the lemma is a useful tool in its own right. It can often reduce the 
cost of inverting an N x N matrix from the 0(N 3 ) cost of Gaussian elimination, to 0(iV 18 ), 
see Remark 3.1. Even more significant is that for many matrices, including the ones under 
consideration in this paper, the process described can be continued recursively which leads to 
an O(N) inversion scheme. The required hierarchical representations of matrices are described 
in Section 4, and the O(N) inversion scheme is given in Section 5. 

Remark 3.1. To assess the computational cost of applying Lemma 3.1, suppose that A is a BS 
matrix whose px p blocks of size nxn satisfy (3.2). Then evaluating the factors D, E, F, and G 
requires 0(pn 3 ) operations. Evaluating (A + D) _1 requires 0(p 3 k 3 ) operations. The total cost 
Tbs of evaluating the factors in formula (3.6) therefore satisfies: 

(3.19) T BS ~pn 3 + p 3 k 3 . 

The cost (3.19) should be compared to the 0(p 3 n 3 ) cost of evaluating A -1 directly. To elucidate 
the comparison, suppose temporarily that we are given a matrix A of fixed size N x N, and 
can choose how many blocks p we wish to partition it into. Then n ~ N/p, and the total cost 
satisfies 

(3.20) T BS ~ p~ 2 N 3 + p 3 k 3 . 

If the rank of interaction k is independent of the block size, we can set p ~ iV 3 / 5 , whence 

(3.21) T BS ~ A; 3 iV 9 / 5 . 
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In practice, the numerical rank of the off-diagonal blocks typically increases slightly as the block 
size n is increased, but the increase tends to be moderate. For instance, for the matrices under 
consideration in this paper, one typically sees k ~ log(ra) ~ log(N/p). In such an environment, 
setting p ~ iV 3 / 5 (logiV)" 3 / 5 transforms (3.20) to, at worst, 

T BS ~ (logAf) 6/5 iV 9/5 . 

The calculations in this remark do not include the cost of actually constructing the factors U, 
V, D. For a general matrix, this cost is 0(kN 2 ), but for the matrices under consideration in 
this paper, techniques with better asymptotic complexity are described in Section 6. 

We close by showing that when the matrix A is symmetric positive definite, the assumption 
in Lemma 3.1 that certain intermediate matrices are invertible can be omitted: 

Corollary 3.2. Let A be a symmetric positive definite (spd) matrix that admits a factorization 

( ^oo) A = U A U* + D, 

{ } N x N NxKKxKKxN N x N 

where ker(U) = {0} and D is a block diagonal submatrix of A. Then the matrices D, (U* D^ 1 U), 
and A + (U* D _1 U) _1 are spd (and hence invertible). 

Proof of Corollary 3.2: That D is spd follows immediately from the fact that it is a block 
diagonal submatrix of a spd matrix. 

To show that U* D" 1 U is spd we pick any x ^ 0, set y = D^ 1 U x, observe that y / since 
ker(U) = {0}, and then we find that (U* D _1 U£C, x) = (D^ 1 U, Ux) = (y, Dy) > since D is spd. 

It remains only to prove that A + (U*D~ 1 U) _1 is spd. To this end, define D and E via 

D = (ITD^U)- 1 
E = D^UD. 

Then 

A + D = D A D" 1 + D" 1 ) D = D (ll* D" 1 U A U*D~ 1 U + D" 1 ) D 

= D (U* D" 1 (A - D)D- 1 U + U*D _1 U) D = D U* D" 1 AD" 1 UD = E*AE. 

That A + (U*D _1 U) _1 is spd now follows since ker(E) = {0} and A is spd. □ 

Remark 3.2. The proof of Corollary 3.2 demonstrates that the stability of the method can 
readily be assessed by tracking the conditioning of the matrices E. If these matrices are close to 
orthogonal (i.e. all their singular values are similar in magnitude) then the compressed matrix 
A + D has about the same distribution of singular values as A since A + D = E* A E. 



4. Hierarchically block separable matrices 

In this section, we define what it means for an N x N matrix A to be HBS. Section 4.1 
informally describes the basic ideas. Section 4.2 describes a simple binary tree of subsets of 
the index vector [1, 2, . . . , N]. Section 4.3 provides a formal definition of the HBS property. 
Section 4.4 describes how an HBS matrix can be expressed a telescoping factorization. Section 
4.5 describes an O(N) procedure for applying an HBS matrix to a vector. 







4.1. Heuristic description. The HBS property first of all requires A to be BS. Supposing that 
A consists of 8 x 8 blocks, this means that A admits a factorization, cf. (3.4): 

U (3) A® (V (3) )* + D (3) 



A 



(4.1) 




The superscripts on the right-hand side of (4.1) indicate that the factorization is at "level 3." 
We next require the smaller matrix A^ 3 ) to be BS, and to admit the analogous factorization: 

AO) = (j(2) X(2) (y( 2 ))* + B< 2 > 



(4.2) 



In forming (4.2), we reblocked the matrix A® by merging blocks in groups of four. The purpose 
is to "reintroduce" rank deficiencies in the off-diagonal blocks. In the final step in the hierarchy, 
we require that upon reblocking, A*- 2 -* is BS and admits a factorization: 

A(2) = u(D aW (V«)* + B« 

(4.3) 



Combining (4.1), (4.2), and (4.3), we find that A can be expressed as 

(4.4) a = y^(u( 2 )(u« b_(°) (v«)* + b«)(\/( 2 ))* + b( 2 ))(\/( 3 ))* + d( 3 ). 

The block structure of the right hand side of (4.4) is: 

y(3) y(2) y(l) g(0) (y(l))* g(l) (y( 2 ))* B_( 2 ) (V®)* . 



m 



In other words, the HBS property lets us completely represent a matrix in terms of certain block 
diagonal factors. The total cost of storing these factors is 0(N k), and the format is an example 
of a so-called "data-sparse" representation of the matrix. 

Remark 4.1. In describing the HBS property, we assumed that all off-diagonal blocks at all 
levels have the same rank k. We do this for notational clarity only. In practice, the minimal 
rank tends to vary slightly from block to block, and to moderately increase on the coarser levels. 
In the numerical examples in Section 7, all codes determine the rank adaptively for each block. 

4.2. Tree structure. The HBS representation of an N x N matrix A is based on a partition of 
the index vector I = [1, 2, . . . , N] into a binary tree structure. For simplicity, we limit attention 
to binary tree structures in which every level is fully populated. We let / form the root of the 
tree, and give it the index 1, l\ = I. We next split the root into two roughly equi-sized vectors I2 
and ^3 so that I\ = I2 U I3. The full tree is then formed by continuing to subdivide any interval 
that holds more than some preset fixed number n of indices. We use the integers £ = 0, 1, . . . , L 
to label the different levels, with denoting the coarsest level. A leaf is a node corresponding 
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Level 



Level 1 



Level 2 



Level 3 




h = [1, 2, . . . , 400] 



h = [1, 2, . . . , 200], 7 3 = [201, 202, . . . , 400] 



7 J J 4 = [1, 2, . . . , 100], h = [101, 102, . . . , 200], 



h = [1, 2, ...,50], J 9 = [51, 52, 100], 



Figure 4.1. Numbering of nodes in a fully populated binary tree with L = 3 
levels. The root is the original index vector / = I\ = [1, 2, . . . , 400]. 



to a vector that never got split. For a non-leaf node r, its children are the two boxes a\ and 
o"2 such that I T = I ai U I a , 2 , and r is then the parent of <ti and 02. Two boxes with the same 
parent are called siblings. These definitions are illustrated in Figure 4.1 

Remark 4.2. The HBS format works with a broad range of different tree structures. It is 
permissible to split a node into more than two children if desirable, to distribute the points in 
an index set unevenly among its children, to split only some nodes on a given level, etc. This 
flexibility is essential when A approximates a non-uniformly discretized integral operator; in 
this case, the partition tree it constructed based on a geometric subdivision of the domain of 
integration. The spatial geometry of a box then dictates whether and how it is to be split. The 
algorithms of this paper can easily be modified to accommodate general trees. 



4.3. Definition of the HBS property. We are now prepared to rigorously define what it 
means for an N x N matrix A to be hierarchically block seperable with respect to a given binary 
tree T that partitions the index vector J = [1, 2, . . . , N]. For simplicity, we suppose that the 
tree has L fully populated levels, and that for every leaf node r, the index vector I T holds 
precisely n points, so that iV = n2 L . Then A is HBS with block rank k if the following two 
conditions hold: 

(1) Assumption on ranks of off-diagonal blocks at the finest level: For any two distinct leaf nodes 
t and t' , define the n x n matrix 

(4.5) k T y = k(I T J T ,). 

Then there must exist matrices U T , V r /, and A T T > such that 

A r )T i = U T A Tr / V*,. 
n x n nxkkxkkxn 



(2) Assumption on ranks of off- diagonal blocks on level £ = L — 1, L — 2, . . . , 1: The rank 
assumption at level £ is defined in terms of the blocks constructed on the next finer level £ + 1: 
For any distinct nodes r and r' on level £ with children o\ , 02 and a[ , a' 2 , respectively, define 



(4.7) 



'CT2,<T 1 0"2,0" 2 

Then there must exist matrices U r , V T /, and A T T > such that 



(4.f 



K,T> 

2k x 2k 



V*,. 



2/c x k k x Jv x 2/c 
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Name: 


Size: 


Function: 


For each leaf 
node r: 


n 

v T 


n x n 
n x k 
n x k 


ine oiagonai diock a^j t , i t j. 

Basis for the columns in the blocks in row r. 

Basis for the rows in the blocks in column r. 


For each parent 
node r: 


u r 
v T 


2k x 2k 
2k xk 
2k xk 


Interactions between the children of r. 

Basis for the columns in the (reduced) blocks in row r. 

Basis for the rows in the (reduced) blocks in column r. 



Figure 4.2. An HBS matrix A associated with a tree T is fully specified if the 
factors listed above are provided. 



The two points above complete the definition. An HBS matrix is now fully described if the 
basis matrices U T and V T are provided for each node r, and in addition, we are for each leaf r 
given the n x n matrix 

(4.9) D r = A(J r ,/ T ), 

and for each parent node r with children a\ and a 2 we are given the 2k x 2k matrix 

k n 



(4.10) 



B r 







Observe in particular that the matrices A CTliCT2 are only required when {a\,a2} forms a sibling 
pair. Figure 4.2 summarizes the required matrices. 

Remark 4.3. The definition of the HBS property given in this section is flexible in the sense 
that we do not enforce any conditions on the factors U T , V r , and A r T / other than that (4.6) and 
(4.8) must hold. For purposes of numerical stability, further conditions are sometimes imposed. 
The perhaps strongest such condition is to require the matrices U r and V T ' in (4.6) and (4.8) be 
orthonormal, see e.g. [26, 6] (one can in this case require that the matrices A T T / be diagonal, so 
that (4.6) and (4.8) become singular value decompositions) . If a "general" HBS matrix is given, 
it can easily be converted to this more restrictive format via, e.g., Algorithm 1. A choice that we 
have found highly convenient is to require (4.6) and (4.8) to be interpolatory decompositions (see 
Section 2.2). Then every U r and V T / contains a k x k identity matrix (which greatly accelerates 
computations), and each A r r / is a submatrix of the original matrix A. 



Remark 4.4. The definition (4.6) transparently describes the functions of the basis matrices 
U r and V r whenever r is a leaf node. The definition (4.8) of basis matrices for non-leaf nodes is 
perhaps less intuitive. Their meaning can be made clearer by defining what we call "extended" 
basis matrices U^. xtend and V^ xtend . For a leaf node r we simply set 



U 



extend 



and 



^extend \j 



For a parent node r with children a\ and C2> we set 



U 



extend 



I I extend 
01 







|j extend 



and 



V 



extend 



wextend 



Then for any distinct nodes r and r' on level £, we have 



A(J T ,/ r 
n2 L ~ l x n2 L ~ 



y extend ; ^wextend ^* 

n 2 L ~ l x k k x k k x n 2 L ~ £ 





wextend 
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Algorithm 1 (reformatting an HBS matrix) 



Given the factors U T; V r , A fTli(J2; D T of an HBS matrix in general format, this algorithm 
computes new factors (that overwrite the old) such that all U T and\/ T are orthonormal, 
and all A CT1)(T2 are diagonal. 

loop over levels, finer to coarser, £ = L — 1, L — 2, ...,0 
loop over all parent boxes r on level £, 
Let o"i and cr 2 denote the children of r. 
[Ui, RJ = qr(U CT1 ). 
[V 1 ,S 1 ] = qr(V ffl ). 



[U 2 , 
[V 2 , 



R 2 ] 
S 2 ] 



qr(v ff2 ) 



[Xi, £12, Y2] - 
V CT1 <-ViYi. 

Bcti(T2 ^~ ^12- 

if Z > 



svd(Ri A CT1 CT2 S2) 



[X 2 , E 2 i, Yi] = svd(R 2 

u 



C2 



-■0-2(7! 



u r < 

end if 
end loop 
end loop 



XJ Ri 



X2 R 2 



U, 



u 2 x 2 . 

V 2 Y 2 . 
£21 • 

YfSi 






Y^S 2 



4.4. Telescoping factorizations. In the heuristic description of the HBS property in Section 
4.1, we claimed that any HBS matrix can be expressed as a telescoping factorization with block 
diagonal factors. We will now formalize this claim. Given the matrices defined in Section 4.3 
(and summarized in Figure 4.2), we define the following block diagonal factors: 

(4.11) D w = diag(D r : r is a box on level I), I = 0, 1, . . . , L, 

(4.12) U w = diag(U T : r is a box on level £), £ = 1, 2, . . . , L, 

(4.13) = diag(V r : r is a box on level f), £ = 1, 2, . . . , L, 

(4.14) B w = diag(B r : r is a box on level £), £ = 0, 1, . . . , L - 1, . 

Furthermore, we let Aw denote the block matrix whose diagonal blocks are zero, and whose 
off-diagonal blocks are the blocks A T T i for all distinct r, r' on level £. With these definitions, 

a = yW AW (yW)* + d( l » ; 

^ ' n2 L xn2 L n 2 L x fc 2 L fc 2 L x fc 2 L jfc 2 L x n 2 L n2 L x n2 L 

for £ = L — 1, L — 2, . . . , 1 we have 

A^ +1 ) = y« AW (yW)* + bW ; 

1 j & 2 £+1 x fc 2 m k 2 e+1 xk2 l k 2 e x k 2 e k 2 £ x k 2 e+1 k 2 i+1 x k 2 e+1 
and finally 

(4.17) aW = B_(°). 

4.5. Matrix- vector multiplication. The telescoping factorizations in Section 4.4 easily trans- 
late into a formula for evaluating the matrix- vector product u = A q once all factors in an 
HBS representation have been provided. The resulting algorithm has computational complexity 
0(N k) (assuming that n = 0(h)), and is given as Algorithm 2. 
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Algorithm 2 (HBS matrix-vector multiply) 

Given a vector q and a matrix A in HBS format, compute u = Aq. 

loop over all leaf boxes r 

q T =V*q(I T ). 
end loop 



loop over levels, finer to coarser, I = L — 1, L 
loop over all parent boxes r on level £, 
Let o"i and 02 denote the children of r. 

Ht — V T 

end loop 
end loop 



ui = 

loop over all levels, coarser to finer, I = 1, 
loop over all parent boxes r on level I 



u 
u 







B 



Ol, (72 





•> • 


.., L- 


- 1 


r. 














. . 





end loop 
end loop 



loop over all leaf boxes r 

u(I T ) = U r u T + D r g(/ T ' 
end loop 



5. Inversion of hierarchically block separable matrices 

In this section, we describe an algorithm for inverting an HBS matrix A. The algorithm 
is exact (in the absence of round-off errors) and has asymptotic complexity 0(Nk 2 ). It is 
summarized as Algorithm 3, and as the description indicates, it is very simple to implement. 
The output of Algorithm 3 is a set of factors in a data-sparse representation of A^ 1 which can 
be applied to a given vector via Algorithm 4. Technically, the scheme consists of recursive 
application of Lemma 3.1, and its derivation is given in the proof of the following theorem: 



Theorem 5.1. Let A be an invertible N x N HBS matrix with block-rank k. Suppose that at the 
finest level of the tree structure, there are 2 L leaves that each holds n points (so that N = n2 L ), 
and that n <2k. Then a data-sparse representation o/A -1 that is exact up to rounding errors 
can be computed in 0(N k 2 ) operations via a process given as Algorithm 3, provided that none 
of the matrices that need to be inverted is singular. The computed inverse can be applied to a 
vector in 0{N k) operations via Algorithm 4- 

Proof: We can according to equation (4.15) express an HBS matrix as 

A = U (L ) A (L) (V (L) )* + d( l ). 
Lemma 3.1 immediately applies, and we find that 

(5.1) A" 1 = (AW + D (i) ) _1 (F^y + G^, 
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where Ej L \ F^, D (L) and G (i) are denned via (3.7), (3.8), (3.9), and (3.10). 
To move to the next coarser level, we set I = L — 1 in formula (4.16) whence 



(5.2) 

We define 



A (L) + D (L) = U^" 1 ) A^" 1 ) (V^- 1 ))* + B^- 1 ) + D 



(L) 



B (L-l) + 5 



D T1 


B rir2 
















B Tir2 


D T2 
























B T3 T4 
















Br3T4 


Dr 4 
























Bt5T6 































where {n, r 2 , 
(5.3) 



, t 2 l} is a list of the boxes on level L. Equation (5.2) then takes the form 



A (L) + 5^) = u(L-i) (y{L-i)y + 5 



(We note that in (5.3), the terms on the left hand side are block matrices with 2 L x 2 L blocks, 
each of size k x k, whereas the terms on the right hand side are block matrices with 2 L ~ 1 x 2 L ~ 1 
blocks, each of size 2k x 2k.) Now Lemma 3.1 applies to (5.3) and we find that 

(A W + D (L) ) - 1 = E^ 1 ) (A^^ 1 ) + D (L_1 Y 1 (F( L ~i)y + G^^ 1 ) , 

where E (L " 1} , F^" 1 ), D (L " 1} and G (L " 1} are defined via (3.7), (3.8), (3.9), and (3.10). 

The process by which we went from step L to step L — 1 is then repeated to move up to 
coarser and coarser levels. With each step, the size of the matrix to be inverted is cut in half. 
Once we get to the top level, we are left with the task of inverting the matrix 



(5.4) 



AW + D 



(i) 



D 2 

63,2 



B2.3 



The matrix in (5.4) is of size 2 k x 2 k, and we use brute force to evaluate 

G(°) = Gi = 62 B2 ' 3 ' 
L B 3 , 2 D 3 

To calculate the cost of the inversion scheme described, we note that in order to compute the 

matrices , F^, G w , and fr ' on level t, we need to perform dense matrix operations on 2 
blocks, each of size at most 2 k x 2 k. Since the leaf boxes each hold at most 2 k points, so that 
2 L+1 k < N, the total cost is 



COST 



E 2 ' 



8 k 



This completes the proof. 



□ 



Remark 5.1. Algorithm 3 produces a representation of A" 1 that is not exactly in HBS form 
since the matrices G T do not have zero diagonal blocks like the matrices B T , cf. (4.10). However, 
a simple technique given as Algorithm 5 converts the factorization provided by Algorithm 3 
into a standard HBS factorization. If a factorization in which the expansion matrices are all 
orthonormal is sought, then further post-processing via Algorithm 1 will do the job. 
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Algorithm 3 (inversion of an HBS matrix) 



loop over all levels, finer to coarser, £ = L, L — 1, 
loop over all boxes r on level £, 
if t is a leaf node 

D T = D T 
else 

Let <Ti and 02 denote the children of r. 



R 

- C> 0"2,0'1 

end if 

D r = (V* D- 1 U T ) 



B 



D m 



E T = D~ 1 U T D T . 
F* = D r V* D" 1 . 
G T = D T - D- 1 U r D T V* D" 1 . 
end loop 
end loop 

D2 B2,3 
B39 D3 



Gi 



Remark 5.2. Algorithm 3 provides four formulas for the matrices {E T , F r , G r , D r } r . The task 
of actually computing the matrices can be accelerated in two ways: (1) Several of the matrix- 
matrix products in the formulas are recurring, and the computation should be organized so that 
each matrix-matrix product is evaluated only once. (2) When interpolatory decompositions are 
used, multiplications involving the matrices U r and V r can be accelerated by exploiting that 
they each contain a k x k identity matrix. 

Remark 5.3. The assumption in Theorem 5.1 that none of the matrices to be inverted is 
singular is undesirable. When A is spd, this assumption can be done away with (cf. Corollary 
3.2), and it can further be proved that the inversion process is numerically stable. When A is 
non-symmetric, the intermediate matrices often become ill-conditioned. We have empirically 
observed that if we enforce that U T = V T for every node (the procedure for doing this for non- 
symmetric matrices is described in Remark 6.1), then the method is remarkably stable, but we 
do not have any supporting theory. 

Remark 5.4. The assumption in Theorem 5.1 that the block-rank k remains constant across 
all levels is slightly unnatural. In applications to integral equations on ID domain, one often 
finds that the rank of interaction depends logarithmically on the number of points in a block. 
It is shown in [21] that inclusion of such logarithmic growth of the interaction ranks does not 
change the 0(N) total complexity. 



6. Computing the HBS representation of a boundary integral operator 

Section 4 describes a particular way of representing a class of "compressible" matrices in a 
hierarchical structure of block-diagonal matrices. For any matrix whose HBS rank is k, these 
factors can via straight-forward means be computed in 0(N 2 k) operations. In this section, we 
describe an 0(N k 2 ) technique for computing an HBS representation of the matrix resulting 
upon Nystrom discretization of a BIE. 
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Algorithm 4 (application of inverse) 

A -1 u using the compressed representation of A -1 



Given a vector u, compute q 
resulting from Algorithm 3. 

loop over all leaf boxes r 

u T = F*u(J r ). 
end loop 



loop over all levels, finer to coarser, £ = L, L — 1, 
loop over all parent boxes r on level £, 
Let a i and a 2 denote the children of r. 



end loop 
end loop 



U, 



Q2 
<?3 



Gi 



u 2 



loop over all levels, coarser to finer, £ = 1, 2, . . . , L 
loop over all parent boxes r on level ^ 
Let c"i and 02 denote the children of r. 



<?<x 2 

end loop 
end loop 



Ur. 



loop over all leaf boxes r 

g(J r ) = E r q T + G T u(I T 
end loop 



6.1. A basic compression scheme for leaf nodes. In this section, we describe how to 
construct for every leaf node r, interpolatory matrices U T and V r of rank k, and index vectors 
l| row) and 4 col) such that, cf. (4.6), 

(6.1) A(/ r ,/ T ,) = U r A(/( row ), /J ol) ) V;„ r + t'. 

The first step in the process is to construct for every leaf r a row of blocks R T and a column of 
blocks C r via 

R T = A(/ T , L T ), and C T = A(L r , J T ), 

where L T the complement of the index vector I T within the full index set, 

L T = {1, 2, 3, ...,N}\I T . 

The condition (4.6) implies that R r and C T each have rank at most k. We can therefore construct 
interpolatory decompositions 

(6.2) R r = U r R(J^ ow) ,: ), 

(6.3) C r = C(: , J^ col) )V;. 
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Algorithm 5 (reformatting inverse) 



Postprocessing the terms computed Algorithm 3 to obtain an inverse in the standard 
HBS format. 

loop over all levels, coarser to finer, £ = 0, 1, 2, . . . , L — 1 
loop over all parent boxes r on level £, 
Let o"i and a 2 denote the children of r. 
Define the matrices H11, B aij!T2 , B CT2i(Jl , 1^,2 so that 



Remark 6.1. It is often useful to use the same basis matrices to span the ranges of both R T 
and C*. In particular, this can make a substantial difference in the stability of the inversion 



Then clearly R T = U T R(J r , : ) and C T = C(: , J T ) U*. Enforcing symmetry can slightly increase 
the HSS-ranks, but typically in a very modest way. 

Remark 6.2. In applications, the matrices R T and C r are typically only approximately of rank 
k and the factorizations (6.2) and (6.3) are then required to hold only to within some preset 
tolerance e. Techniques for computing rank-revealing partial factorizations of this type are 
described in detail in [14, 9]. 

6.2. An accelerated compression scheme for leaf nodes. We will in this section demon- 
strate how to rapidly construct matrices U T , V T and index vectors Jr° w \ jf 00 ^ such that (6.2) 
and (6.3) hold. We focus on the construction of U r and jf row ^ since the construction of V r and 

Jr C °^ is analogous. While U r and J^ row) can in principle be constructed by directly computing 
an interpolatory decomposition of R T this is in practice prohibitively expensive since R r is very 
large. The way to get around this is to exploit analytic properties of the kernel to construct a 
much smaller matrix R(. sma11 ) whose columns span the column space of R r . Then we can cheaply 
form U r and jf row ' ) by factoring this smaller matrix. The process typically over-estimates the 
rank slightly (since the columns of R(. smd11 ) w jH span a slightly larger space than the columns of 
R r ) but this is more than compensated by the dramatic acceleration of the compression step. 
To formalize matters, our goal is to construct a small matrix R(. sma11 ) suc h that 




end loop 
end loop 



loop over all leaf boxes r 

D r = G T . 
end loop 



Now (6.1) holds if we set 



/(row) = J r ( J^row) ) and J(col) = j r (jH ) _ 




(6.5) 
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-p(proxy) 




(b) 



-p(far) I 



Figure 6.1. A contour T. (a) T T is drawn with a bold line, (b) The contour 
p(near) . g j rawil w ^h a thin solid line and r r far ^ with a dashed line. 



Then all we would need to do to compress r is to construct the interpolatory decomposition 
(6.6) R(. small) = U r R(small) ( -j^ow) ) . ) 

since (6.5) and (6.6) together imply (6.2). 

When constructing the matrix R^- small \ we distinguish between near field interaction and far 
field interactions. The near field interactions cannot readily be compressed, but this is not a 
problem since they contribute a very small part of R r . To define what we mean by "near" and 
"far," we need to introduce some notation. Let T T denote the segment of V associated with the 
node r, see Fig. 6.1. We enclose T T in a circle and then let r^?™*^ denote a circle with the same 
center but with a 50% larger radius. We now define r r far ^ as the part of T outside of r r proxy ^ 
and define r(. near ) a s the part of T inside r r proxy ^ but disjoint from T T . In other words 

r = r r u r(. ncar ) u rt far) 

forms a disjoint partitioning of T. We define L r near ^ and L r f ^ so that 

{1, 2, 3, . . . , N} = I T U 4 ncar ) U 4 far) 
forms an analogous disjoint partitioning of the index vector {1, 2, . . . , N}. We now find that 



(6.7) 



A(/ r ,4 ncar )) I A(I r ,4 fa 



n 



where l~l is a permutation matrix. We will construct a matrix R^ proxy ^ such that 
(6.8) 

and then we set 
(6.9) 



Ran(A(/ r ,4 far ))) C Ran(R(P roxy )) , 

p(small) _ A(7 rj ^(ncar)-j | p(proxy) 



That (6.5) holds is now a consequence of (6.7), (6.8) and (6.9). 

All that remains is to construct a small matrix R^ proxy ^ such that (6.8) holds. We describe the 
process for the single layer potential associated with Laplace's equation (for generalizations, see 
Remarks 6.4, 6.5, 6.6, and 6.7). Since we use Nystrom discretization, the matrix A(/ T , Lr^) 
in this case represents evaluation on r r of the harmonic potential generated by a set of point 

(far) 

charges on T\- . We know from potential theory that any harmonic field generated by charges 
outside ri proxy ^ can be generated by placing an equivalent charge distribution on 4 proxy \ Since 



we only need to capture the field to within some preset precision e, it is sufficient to place charges 

>.i= 



on a finite collection {zj}j =1 of points on 4 proxy ) 
R^(i,j) = log\ Xl 



'3 I ' 



. In other words, we set 
i€ I T , JG{1, 2, 3,..., J}. 
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The number of charges J that are needed on the external circle depends on the precision e 
required. In fact J = 0(log(l/e)) as e — > 0. We have found that using J = 50 points is sufficient 
to attain ten digits of accuracy or better. 

The construction of a small matrix c^" 13,11 ) th a t can fog use d to construct V T and J^ 00 ^ such 
that (6.3) holds is entirely analogous since the matrix C* is also a map of a charge distribution 
on to a potential field on T T . The only caveat is that the rows of C* must be scaled by 

the quadrature weights used in the Nystrom method. 

Remark 6.3. As an alternative to placing charges on the exterior circle, one could repre- 
sent the harmonic field generated on T T by an expansion in the cylindrical harmonic functions 
{r- 5 cos(j(9), r J sin(j'0)}y =o where (r, 9) are the polar coordinates of a point in the circle enclosing 
T r . The number of functions needed are about the same, but we found it easier to correctly 
weigh the two terms A(/ T , L^^) and R^ proxy ^ when using proxy charges on the outer circle. 

Remark 6.4 (Extension to the double layer kernel). The procedure described directly gener- 
alizes to the double layer kernel associated with Laplace's equation. The compression of R T is 
exactly the same. The compression of C* is very similar, but now the target field is the normal 
derivative of the set of harmonic potentials that can be generated by sources outside r^ proxy \ 

Remark 6.5 (Extension to Laplace's equation in M 3 ). The scheme described generalizes im- 
mediately to BIEs defined on surfaces in R 3 . The circles must be replaced by spheres, and the 
complexity is no longer linear, but the method remains competitive at low and intermediate 
accuracies [12]. 

Remark 6.6 (Extension to Helmholtz and other equations). The scheme described has been 
generalized to the single and double layer potentials associated with Helmholtz equation, see 
[21]. The only complication happens the frequency is close to a resonant frequency of the proxy 
circle. This potential problem is avoided by placing both monopoles and dipoles on the proxy 
circle. 

Remark 6.7 (Extension to integral equations on the line). The acceleration gets particularly 
simple for integral equations on a line with smooth non-oscillatory kernels. In this case, the 
range of A(I T , Lr^) can typically be represented by a short expansion in a generic set of basis 
functions such as, e.g., Legendre polynomials. 

6.3. Compression of higher levels. The method described in Section 6.2 rapidly constructs 
the matrices U T , V r and the index vector Jr° w \ J-f 00 ^ for any leaf node r. Using the notation 
introduced in Section 4.4, it computes the matrices U ( L >, VW, and A( L ). It is important to 
note that when the interpolatory decomposition is used, is in fact a submatrix of A, and 
is represented implicitly by specifying the relevant index vectors. To be precise, if r and r' are 
two nodes on level L — 1, with children a±, o<i and a^, a' 2 , respectively, then 



A-r.7 



A(l£ ow) , /(? ol) ) A(/£ ow) , ) 
A(/£ ow) , /(? ol) ) A(/£ ow) , /| ol) ) 



The observation that A^) is a submatrix of A is critical. It implies that the matrices LJ^ L 1 \ 
)J_ {L ~ l \ and A^" 1 ) can be computed using the strategy of Section 6.2 without any modifications. 

6.4. Approximation errors. As mentioned in Remark 6.2, factorizations such as (6.2), (6.3), 
(6.4), (6.6) are in practice only required to hold to within some preset tolerance. In a single-level 
scheme, it is straight-forward to choose a local tolerance in such a way that ||A — A approx || < e 
holds to within some given global tolerance e. In a multi-level scheme, it is rather difficult to 
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predict how errors aggregate across levels, in particular when the basis matrices U T and V r are 
not orthonormal. As an empirical observation, we have found that such error propagation is 
typically very mild and the error ||A — A a pp rox || is very well predicted by the local tolerance in 
the interpolatory decomposition. 

While there are as of yet no a priori error guarantees, it is often possible to produce highly 
accurate a posteriori error estimates. To this end, let q = A -1 / and q a pprox = A approx / denote 
the exact and the approximate solution, respectively. Then 

llQapprox — q\\ = ||A ap p rox Aq — A a p prox A approx q\ | < ||A ap p rox || ||A — A approx || \ \q\\. 

We can very rapidly and accurately estimate ||A~ prax || via a power iteration since we have access 
to a fast matrix- vector multiply for A~ pprox . The factor ||A — A approx || can similarly be estimated 
whenever we have access to an independent fast matrix- vector multiplication for A. For all BIEs 
discussed in this paper, the Fast Multipole Method can serve this function. If it is found that 
the factor ||A ap x || ||A — A approx || is larger than desired, the compression can be rerun with a 
smaller local tolerance. (The argument in this section assumes that the inversion is truly exact; 
a careful error analysis must also account for propagation of round-off errors.) 



7. Numerical examples 

The performance of the direct solver was tested on linear systems arising upon the Nystrom 
discretization of the BIE 

(7.1) \ q{x) + / n{ *'] ' ~ f } q(x') dl(x') = f(x), x G T, 

where n(y) is a unit normal to the contour T. Equation (7.1) is a standard BIE representation 
of the Laplace equation on a domain bordered by T when Dirichlet boundary data is specified 
[19, Sec. 6.3]. Three geometries were considered: 

• Smooth star: The star domain illustrated in Figure 7.1(a) is discretized via Gauss- 
ian quadrature as described in Section 2.3. In the experiment, we fix the size of the 
computational domain and increase the number of discretization points. This problem 
is artificial in the sense that the largest experiments use far more quadrature points 
than what is required for any reasonable level of accuracy. It was included simply to 
demonstrate the asymptotic scaling of the method. 

• Star with corners: The contour T consists of ten segments of circles with different 
radii as illustrated in Figure 7.1(b). We started with a discretization with 6 panels per 
segment and 17 Gaussian quadrature nodes per panel. Then grid refinement as described 
in [17, 5] (with a so called "simply graded mesh") was used to increase the number of 
discretization points, cf. Remark 2.2. 

• Snake: The contour T consists of two sine waves with amplitude 1 that are vertically 
separated by a distance of 0.2, see Figure 7.1(c). At the end points, two vertical straight 
lines connect the waves. Composite Gaussian quadrature was used with 25 nodes per 
panel, four panels on each vertical straight line (refined at the corners to achieve 10 digits 
of accuracy), and then 10 panels were used per wave-length. The width of the contour 
was increase from 2 full periods of the wave to 200, and the number of discretization 
points N was increased accordingly. 

A compressed representation of the coefficient matrix was for each example computed via 
Matlab implementations of the compression scheme described in Section 6. Then Fortran 77 
implementations of Algorithms 3 (inversion of an HBS matrix), 5 (conversion to standard HBS 
format), and 2 (matrix- vector multiply) were used to directly solve the respective linear systems. 
All codes were executed on a desktop computer with 2.8GHz Intel i7 processor and 2GB of 
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(a) 



(b) 



(c) 



Figure 7.1. The contours T used in the numerical experiments in Section 7. 
(a) Smooth star, (b) Star with corners, (c) Snake. 



RAM. The speed of the compression step can be improved significantly by moving to a Fortran 
implementation, but since this step is somewhat idiosyncratic to each specific problem, we believe 
that is representative to report the times of an unoptimized Matlab code. 

For all experiments, the local tolerance in the compression step was set to e = 10~ 10 . (In 
other words, the interpolatory factorization in (6.6) was required to produce a residual in the 
Frobenius norm no larger than e.) 

To asses the accuracy of the direct solver, the error ||A — A a pp rox || was estimated via a power 
iteration after each compression (the matrix A and its transpose were applied via the classical 
FMM [13] run at very high accuracy). The quantity ||A~. p rox || was also estimated. The results 
are reported in Figure 7.3. The quantities reported bound the overall error in the direct solver, 
see 6.4. 

8. Extensions and future work 

While the numerical examples in this paper concerned only BIEs associated with Laplace's 
equation, then method can readily be extended to other elliptic problems, as long as the kernel 
is not too oscillatory. The extension to Helmholtz equation was reported in [21] and [22]. 

The direct solver described can also be applied to BIEs defined on surfaces in M 3 without al- 
most any modifications [12]. The complexity then grows from O(N) to 0(N 1 ' 5 ) for the inversion 
step, while O(N) complexity is retained for applying the computed inverse. The complexity of 
the inversion step can be improved by doing a "recursion on dimension" in a manner similar 
to the recently described O(N) nested dissection schemes reported in [25, 7, 11]. This work is 
currently in progress. 

The error analysis in the present paper is very rudimentary. Numerical experiments indicate 
that the method is stable and accurate, but this has not yet been demonstrated in any case 
other than that of symmetric positive definite matrices. 

Acknowledgements: The work reported was supported by NSF grants DMS0748488 and 
DMS0941476. 
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